Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos
نویسندگان
چکیده
Grounding textual phrases in visual content with standalone image-sentence pairs is a challenging task. When we consider grounding in instructional videos, this problem becomes profoundly more complex: the latent temporal structure of instructional videos breaks independence assumptions and necessitates contextual understanding for resolving ambiguous visual-linguistic cues. Furthermore, dense annotations and video data scale mean supervised approaches are prohibitively costly. In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision. We introduce the visually grounded action graph, a structured representation capturing the latent dependency between grounding and references in video. For optimization, we propose a new reference-aware multiple instance learning (RA-MIL) objective for weak supervision of grounding in videos. We evaluate our approach over unconstrained videos from YouCookII and RoboWatch, augmented with new reference-grounding test set annotations. We demonstrate that our jointly optimized, reference-aware approach simultaneously improves visual grounding, reference-resolution, and generalization to unseen instructional video categories.
منابع مشابه
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, mapping between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicte...
متن کاملUnsupervised Learning and Segmentation of Complex Activities from Video
This paper presents a new method for unsupervised segmentation of complex activities from video into multiple steps, or sub-activities, without any textual input. We propose an iterative discriminative-generative approach which alternates between discriminatively learning the appearance of sub-activities from the videos’ visual features to sub-activity labels and generatively modelling the temp...
متن کاملSemi-supervised Learning for Identifying Players from Broadcast Sports Videos with Play-by-Play Information
Tracking and identifying players in sports videos filmed with a single moving pan-tilt-zoom camera has many applications, but it is also a challenging problem due to fast camera motions, unpredictable player movements, and unreliable visual features. Recently, [26] introduced a system to tackle this problem based on conditional random fields. However, their system requires a large number of lab...
متن کاملGrounding Action Descriptions in Videos
Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high qualit...
متن کاملContext-aware visual analysis of elderly activity in a cluttered home environment
This paper presents a semi-supervised methodology for automatic recognition and classification of elderly activity in a cluttered real home environment. The proposed mechanism recognizes elderly activities by using a semantic model of the scene under visual surveillance. We also illustrate the use of trajectory data for unsupervised learning of this scene context model. The model learning proce...
متن کامل